class: center, middle, inverse, title-slide # Statistics and such ## Learning from data ### Brendan Alexander ### 2020/7/28 (updated: 2020-08-25) --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif) <style type="text/css"> # .remark-slide-content { # font-size: 28px; # padding: 20px 80px 20px 80px; # } # .remark-code, .remark-inline-code { # background: #f0f0f0; # } # .remark-code { # font-size: 24px; # } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .tiny .remark-code { /*Change made here*/ font-size: 70% !important; } </style> Image credit: [Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif) - The EM algorithm at work - Definitely not classical statistics - Statistics, machine learning, and computer science have come together to create interesting solutions to data problems --- # Who am I? .pull-left[ - I work with population models that model population growth and inheritance. - Herbicide resistance/herbicide stewardship in waterhemp - Theoretical and empirical approaches MS Plant Science - (University of Alberta, November 2016) MS Applied Statistics - (University of Illinois Urbana-Champaign, August 2020) PhD Crop Science - not quite, but soon I hope... ] .pull-right[  ] --- # What is statistics? <!-- - Inverse probability (mostly a defunct definition, but I really like it) --> <!-- - Consider a 6 sided die. If it's a fair die then we know that the probability of rolling a **1** is `\(\frac{1}{6}\)` --> <!-- - But what if instead I tell you that I roll a die 5 times and the results are: 1, 1, 5, 20. What's the most likely type of die that I rolled? (What's the data-generating model?) --> - Statistics is all about learning from data - There are two main ways to learn: Inference and Prediction - There are *many* different ways to view statistics, but I think this is a pretty reasonable way to categorize things .pull-left[ **Inference:** Learning about mechanisms and relationships; building and understanding the biological meaning of models ] .pull-right[ **Prediction:** Using massive quantities of data to predict events (elections, sports, how many hamburgers to prepare ahead of a lunch rush) ] - The boundaries between these concepts are not black and white, this is just a general guide --- # Inference - Classical statistics, Frequentist, Likelihood-ist, Bayesian - Exploratory: Graphing, finding patterns, hypothesizing - Confirmatory: Model building, hypothesis tests - This is where most Agriculture data goes. - These are usually parametric problems (parametric defined later) - Examples - Graphs - Linear regression (fitting a line) - ANOVAs, ANCOVAs (still just linear regression but now with fancy names) - Generalized models (logistic regression, Poisson/beta regression) - Linear regression with a non-Gaussian error - Mixed models (some longitudinal, split plots, hierarchical designs) - Non-linear models (dose-response curves, anything that linear algebra can't represent) - Probability distributions --- # Parametric and non-parametric statistics **Parametric statistics** (tends to be used for inferential statistics, but not always) - We have an idea of what the probability distributions are, or we're at least willing to make some assumptions. - This is most statistics that we do in biology and agriculture, but there are important exceptions, especially with machine learning and prediction - Linear and non-linear models/Mixed models/Generalized Additive models (at least the error distribution) - Multivariate methods like PCA and Factor analysis **Non-parametric statistics** (tends to be used for predictive statistics, but not always) We don't know what the probability distributions are that our data represent and we are not willing to make assumptions - Rank methods/Kernel methods/Random forest - Machine learning methods tend to use non-parametric methods because the data sets are so large and have so many variables (features) that distributional assumptions can't easily be checked - Machine learning methods also tend to have prediction as the goal, not inference - In biology we tend to want inference first Both parametric and non-parametric methods can be used for inference, but parametric methods are more powerful (at the cost of making assumptions) --- .pull-left[ # Inferential: Exploratory analysis *Hypothesis generating* - This is an approach to data analysis that focuses on non-parametric methods of interpreting data - Non-parametric really just refers to methods that don't try to use data to estimate parameters from probability distributions or models (which we'll get to in a bit) - We're looking for trends, relationships, and/or patterns that we might like to investigate further - Graphing data is a major exploratory method ] .pull-right[ **`mtcars` data** ```r pairs(mtcars[,1:5]) ``` <!-- --> ] --- # Inferential: Confirmatory analysis *Hypothesis testing* .pull-left[ - Confirmatory analysis is the testing or evaluation of possible hypotheses/models using data. - These analyses are typically parametric, but not always. - These analyses almost always try to conduct hypothesis tests (not always correctly) $$ `\begin{aligned} H_0&: \beta_1=0\\ H_A&:\beta_1\neq0 \end{aligned}` $$ ```r fit <- lm(hp~cyl,mtcars) confint(fit) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -102.0743 -0.03442479 ## cyl 24.0265 39.89006527 ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="90%" /> ] --- # Lots of other output, what does it all mean? - There's way more output here than just the two parameter estimates for a linear function. - We aren't dealing with functions, we're dealing with distributions .tiny[ ```r summary(fit) ``` ``` ## ## Call: ## lm(formula = hp ~ cyl, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -54.61 -25.99 -11.28 21.51 130.39 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -51.054 24.982 -2.044 0.0499 * ## cyl 31.958 3.884 8.229 3.48e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.62 on 30 degrees of freedom ## Multiple R-squared: 0.693, Adjusted R-squared: 0.6827 ## F-statistic: 67.71 on 1 and 30 DF, p-value: 3.478e-09 ``` ] --- class: center # Functions vs. distributions .pull-left[ **Functions** No stochastic (random) component $$ `\begin{aligned} y&=\underbrace{\beta_0 +\beta_1x}_{deterministic}\\ \\ y&=2+x \end{aligned}` $$ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="50%" /> ] .pull-right[ **Distributions** $$ `\begin{aligned} y&=\underbrace{\beta_0 +\beta_1x}_{deterministic} + \underbrace{\epsilon}_{stochastic}\\ \\ y&=2+x + \epsilon, \:\:\:\: \epsilon\sim \mathcal{N(0,25)}\\ \end{aligned}` $$ <img src="index_files/figure-html/unnamed-chunk-7-1.gif" width="50%" /> ] ---
--- # Simulate our own model Let's create some linear regression data from the ground-up. We need: 1. A functional relationship between two variables. 2. A probability distribution to describe error. We'll use the normal distribution to start. `$$y_i = f(x_i)$$` `$$\epsilon_i \sim \mathcal{N}(0,\sigma^2)$$` The true statistical model is: `$$y_i \sim f(x_i) + \epsilon_i$$` or, for simple linear regression `$$\underbrace{y_i}_{DV} \sim \underbrace{\beta_0}_{Intercept} + \underbrace{\beta_1}_{slope}\underbrace{x_i}_{IV}+\underbrace{\epsilon_i}_{error}$$` where DV is dependent variable, and IV is independent variable. --- ## Create a functional relationship .pull-left[ `$$y_i=5+2x_i$$` ] .pull-right[ <div class="figure"> <img src="index_files/figure-html/unnamed-chunk-10-1.png" alt="A beautiful plot, by me. This is the plot [p1]. Notice that this is not a statistical relationship, there is no error." width="70%" /> <p class="caption">A beautiful plot, by me. This is the plot [p1]. Notice that this is not a statistical relationship, there is no error.</p> </div> ] --- ## Generate and plot the error .pull-left[ ```r set.seed(90210) e <- data.frame(error = rnorm(n = 50, mean = 0, sd = 10)) ``` Looks alright. Keep this in mind whenever you check the assumption of normality for real data: it's not always pretty, but sometimes it's good enough. ] .pull-right[ <!-- --> ] --- ## Create the statistical relationship .pull-left[ OK! Let's combine these components to simulate some regression data. .tiny[ ```r summary(lm(yvar~xvar,df2)) ``` ``` ## ## Call: ## lm(formula = yvar ~ xvar, data = df2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.0013 -5.5597 0.9162 5.4365 20.6247 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.645 2.958 0.894 0.376 ## xvar 2.002 0.101 19.829 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 48 degrees of freedom ## Multiple R-squared: 0.8912, Adjusted R-squared: 0.8889 ## F-statistic: 393.2 on 1 and 48 DF, p-value: < 2.2e-16 ``` ] ] .pull-right[ <!-- --> ] --- # Ordinary least squares (OLS) vs. Maximum likelihood (ML) We just used OLS, it's a very old technique based entirely on linear algebra. - OLS is just linear algebra and has minimal assumptions - it can be done easily by hand for small data sets - This is what people did back in 1930 - I think it's rarely used now, our modern models are generally too complex - OLS produces the best linear unbiased estimates (BLUEs) as long as some assumptions are met: - Linear model is reasonable - Constant variance - Normality isn't technically an assumption for estimation, but it is for producing intervals - ML is very powerful but also makes a lot of distributional assumptions about the data - We're looking to maximize the likelihood of seeing our data by testing many different models/parameter estimates - It's more like doing a grid search for the most likely parameters to have generated the data you saw - In fact, we can do it "by hand" like that --- # Visualization of a log-likelihood surface (for ML) We fit fancy, complex models with maximum likelihood. But what is this mathematical voodoo? `$$\mathcal{L}(parameters|obs)=\prod_{i=1}^nf(obs_i)$$` The log-likelihood is usually used instead for numerical stability. Products change to sums when a `\(\log\)` function is applied. `$$\log{\mathcal{L}(parameters|obs)}=\sum_{i=1}^n \log{f(obs_i)}$$` The goal is to maximize the likelihood (or log-likelihood) to find good parameter estimates. --- # An example with the normal distribution .pull-left[ The log likelihood of seeing a `\(43\)` and a `\(38\)` when we have a normal distribution with `\(\mu =40\)` and `\(\sigma^2=10\)` is: .tiny[ ```r x <- seq(30, 50, length.out = 100) hx <- dnorm(x,mean = 40,sd = sqrt(10)) ``` ] .tiny[ ```r log(dnorm(x = 43,mean = 40,sd = sqrt(10))) + log(dnorm(x = 38,mean = 40,sd = sqrt(10))) ``` ``` ## [1] -4.790462 ``` ] ] .pull-right[ .tiny[ ```r plot(x, hx, type="l", lty=1, xlab="x value", ylab="Density", main="Likelihood of x=38 and x=43 when we assume a \n normal distribution with mean=40 and varaince=10") segments(x0 = 43,y0 =0 ,x1 = 43,y1 = dnorm(x = 43,mean = 40,sd = sqrt(10)),col="green",lty=2) segments(x0 = 38,y0 =0 ,x1 = 38,y1 = dnorm(x = 38,mean = 40,sd = sqrt(10)),col="green",lty=2) ``` <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="70%" /> ] ] --- # Let's create some non-normal data! .pull-left[ .tiny[ ```r set.seed(90210) lldata <- data.frame(xvar=as.numeric(seq(0,5,0.1))) lldata2 <- lldata%>% mutate(lam=50*exp(-.9*xvar))%>% mutate(yvar=rpois(n = nrow(lldata), lambda = lam)) ``` ] Here's our equation: `$$\lambda_i=a\cdot e^{b \cdot x}$$` `$$\lambda_i=50\cdot e^{-0.9 \cdot x}$$` `$$y_i \sim Poisson(\lambda_i)$$` ] .pull-right[ .tiny[ ```r plot1 <- ggplot(data = lldata2,aes(x=xvar,y=yvar)) plot1 <- plot1 + geom_point() plot1 <- plot1 + labs(title="Plot of present count by distance from the Christmas tree.", x ="Distance", y = "Presents") plot1 ``` <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="70%" /> ] ] --- .pull-left[ <!-- We'll need to estimate some reasonable starting parameters. --> <!-- This can be a guessing game, but usually there are a few shortcuts. --> <!-- For example, if we log transform our data and perform a linear analysis then we can get reasonable starting values. --> <!-- Other models have similar tricks. --> <!-- If you know what your parameters represent, then you can often make good guesses from a graph. --> <!-- Now, remember that a model using the log-transformed values is not what we actually want. --> <!-- We're just after starting parameter estimates, and this should get us into the right ballpark. --> <!-- By log-transforming the observations we're trying to approximate the log-transformed response surface. --> <!-- `$$\log{y_i}=\log{a} + bx \approx \log{\hat{y_i}}=\log{a} +bx$$` --> <!-- These equations ARE NOT equal, but maybe we can get good estimates of the parameters. --> <!-- Notice that we have `\(\log{a}\)`, not `\(a\)`. --> <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="90%" /> ] .pull-right[ Looks neat! How does it compare to a real `glm()` fit? .tiny[ ```r fit_test <- glm(yvar~xvar,family=poisson,data=lldata2) coef(fit_test) ``` ``` ## (Intercept) xvar ## 3.9521029 -0.9092647 ``` ] Not bad! Looks like if we ran enough simulations that we'd probably get about the same answer. Note that: ```r exp(3.9521029) ``` ``` ## [1] 52.0447 ``` ] --- # How's it look in 3D? ```r plot_ly(x=search_grid$a, y=search_grid$b, z=search_grid$loglik, type="scatter3d",color = search_grid$loglik) ```
--- # Prediction example: Random forest Before we talk about random forest, we have to know what a CART model is # Classification and regression trees - They are essentially decision trees - Regression or classification - non-parametric - Easy to interpret - Want to avoid overfitting, typically the number of terminal nodes is controlled by a tuning parameter `\(\alpha\)` (just like Ridge, LASSO, and GAM) - Find the best `\(\alpha\)` for the data using cross validation - Let's just look at one --- class: center # `mtcars` tree `mpg~.` <!-- --> --- # Bagging CART models are easy to interpret, but their predictions are not competitive with other machine learning methods. Bagging is a general bootstrap resampling method that helps reduce the variance for many machine-learning methods. To understand what's going on we should remind ourselves about the definitions of bias and variance in the a machine learning context. --- # Bias and variance (machine learning) - Variance: the amount of change that we see in our model estimates when we change the model training data. - Flexible fits tend to have higher variance - Bias: the ability of our model to reflect real life - Flexible models/models with more parameters tend to have low bias --- # This would be a low-variance, high bias model fit (probably, it's underfit) .pull-left[ - This is an example of a low-variance model: it's so simple that we would expect the parameter estimates to remain constant over new datasets. - It's high bias: it's too simple to adequately explain the phenomenon, it adds error by just constantly being wrong ] .pull-right[ ```r fit2 <- lm(mpg~wt,mtcars) plot(mtcars$mpg~mtcars$wt) abline(lm(mtcars$mpg ~ mtcars$wt)) ``` <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="60%" /> ] --- # This would be a high-variance model fit (probably, it's overfit) .pull-left[ - This is an example of a high-variance model: it's so flexible that it will be very sensitive to changes in the data - It's low bias: it's flexible enough to model the phenomenon ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-31-1.png" width="75%" /> ] --- # Back to bagging - If we fit full trees (or just really bushy trees) our tree will have low bias and very high variance. - The high variance makes it more or less useless for predictions. - However, we can use bootstrap resampling and build _**lots**_ of bushy trees. (Individual trees are low-bias, high-variance) - Then we average the predictions of all trees to produce low-variance predictions. **Terminology** Bootstrap resampling resamples the data _**with replacement**_ to create a new sample of the same size as the original sample. We're assuming that the data are representative of the population (which we were doing anyways). --- # Out of bag error (OOB) Fun fact: On average, if you bootstrap resample a dataset then, on average, about `\(\frac{2}{3}\)` of the data will be used. This means that for each tree we'll have on average `\(\frac{1}{3}\)` of the data going unused. It turns out that we can use the unused data as test data (just like cross validation) and generate prediction error estimates for each tree (OOB MSE). We could use OOB error as a way to get tuning parameters instead of using cross validation. This is especially useful for very large data sets (cross validation can take days). I don't think there are any major tuning parameters for bagging (but there will be for random forest) --- # Random forest (finally) Random forest builds on bagging by doing something that may seem strange. To reduce correlation between trees, random forest samples a random set of predictors for each tree. The default number is `\(\sqrt{p}\)` for random forest, however this is a tuning parameter that can (should) be optimized. # Example Alberta has forest fire data online and for free available here: https://wildfire.alberta.ca/resources/historical-data/historical-wildfire-database.aspx We can use this data to see if we can predict whether forest firest were caused by lightning or by recreation.  --- class: center # Predictions: Regularized logistic regression vs. Random Forest  --- class: center # Variable importance (because we can't plot hundreds of trees)  --- class: center # Map based visualization  --- class: center background-image: url("pictures/end.jpg") # Discussion and questions